# Automation classification
## What's this?
This package creates our automation patent classification.
It provides our non-trivial keyword search on the fulltext of all EP patents.
It also provides tools to classify Patstat patents based on CPC/IPC codes.

This package runs in three parts:
1. building the index
2. collecting features of patents
3. classifying and applying the classification to patstat.

Step 1 and 2 require a harddrive with EPO backfile patent data. We provide the output of step 2 (appln_features.csv). Step 3 requires patent data from Patstat, we provide the output in the form of list of classified applications and c/ipc codes.

## Files in order of execution
-  classification.ps1 | loads the pipenv and stata dependencies and calls the respective python files. Serves as interface.
-  EPO_Database.py | contains an interface to the EP backfile. Takes care of building an index of the patents and accessing the patents by various means. Also contains the codes for the      
   keyword search, which outputs the matching keywords for each EP patent.
-  ipcclassifier.py | Uses information on IPC / CPC codes to build the automation classification.


## Patstat files
If you want to run the classification part, you need the following files from the Patstat workflow, placed inside ./classification/patstat.:

- appln_year.csv: csv with columns appln_nr, ipc4 for the application number and the application year (1:1)
- appln_ipc6.csv: csv with columns appln_nr, ipc4 for the application number and ipc6 code (1:n)
- appln_ipc4.csv: csv with columns appln_nr, ipc4 for the application number and ipc4 code (1:n)
- docdb_family_id_cipc_codes.csv: csv with columns docdb_family_id,cipc6: a 1:n map from the docdb family id of a patent to its cipc6 code
- ipc_techn_fields.csv

The files are generated (provided you have patstat data), after the EPO_Database.py file has run, by stata dofiles. The dofiles are located in the main code directory ./code/assembly/classification:

-   biadic_families.do | flags patent families that are applied for in multiple countries
-   ipc_cpc_codes.do | makes a combined ipc/cpc (cipc) code and maps to applications
-   docdb_families_ipc_codes.do | maps patent families to the combined cipc codes
-   appln_ipc.do | maps cipc to applications in lists that python created
-   adjusted_citations.do | nomralized citations by technological field
-   fields.do | matches technical fields to patent families

### EPO data

We use A1 and B1 (!) files for our analysis. Plus, EPO_Database.py expects a certain structure to your files. Whatever directory you palce your fiels in should be called /EPbackfile. Inside we follow a Year/week/kind/document.zip structure. An example with the A4 **(Note: will not work with A4! Purely illustrative)** data available on the EPO website as sample data "EP full text data in a bulk sample data with complimentary EP A4 in XML/PDF":

EPRTBA42022000029001001.zip -> inside you will find a /DOC folder and inside that /EPNWA4. Here EPRTBA42022000029001001 is the week and 2022 the year. EPNWA4 is the kind folder. Create a folder 2022 with a subfolder EPRTBA42022000029001001 for the week and then extract the sample file zip (EPNWA4) into that place, so that files have the following path: ./EPbackfile/2022/EPRTBA42022000029001001/DOC/EPNWA4/{patentname}.zip

## Usage
Run the master file in the main directory, select classification and select your available data.
Then the classification.ps1 file will be called, where you can select in more detail which parts to run and for which periods you would like to classify.

